Real-time 3D gesture recognition and tracking system for mobile devices

ABSTRACT

The disclosure relates to a device and a method in the device for recognizing a 3D gesture. The device is connected to a sensor and has access to a database of gesture images including indexable features of normalized gesture images. The indexable features include a position and an orientation for each pixel of edge images of the normalized gesture images. The method includes capturing (110) an image of the 3D gesture via the sensor, normalizing (120) the captured image, deriving (130) indexable features from the normalized captured image, and comparing (140) the derived indexable features with the indexable features of the database using a similarity function. The method also includes determining (150) a gesture image in the database matching the 3D gesture based on the comparison.

TECHNICAL FIELD

The disclosure relates to gesture recognition, and more specifically toa device and a method for recognizing a 3D gesture.

BACKGROUND

The human hand has 27 degrees of freedom (DoF): four in each finger,three for extension and flexion and one for abduction and adduction; thethumb is more complicated and has five DOF, leaving six DOF for therotation and translation of the wrist. Capturing hand and finger motionin video sequences is a highly challenging task due to the large numberof DoF of the hand kinematics. This process is even more complicated onhand-held smart devices due to the limited power and expensivecomputations.

Basically the common existing solutions follow the steps illustrated inFIG. 1. The query image sequence captured by sensor/s will be analyzedto segment user hand/fingers. Image analysis algorithms, such asbackground removal, classification, feature detection etc. are utilizedto detect hand/fingers. In fact, existing algorithms of hand trackingand gesture recognition can be grouped into two categories: appearancebased approaches and 3D hand model based approaches (US2010053151A1,US2010159981A1, WO2012135545A1, and US2012062558A1). The former arebased on a direct comparison of hand gestures with 2D image features.The popular image features used to detect human gestures include handcolors and shapes, local hand features and so on. The drawback offeature-based approaches is that clean image segmentation is generallyrequired in order to extract the hand features. This is not a trivialtask when the background is cluttered, for instance. Furthermore, humanhands are highly articulated. It is often difficult to find local handfeatures due to self-occlusion, and some kinds of heuristics are neededto handle the large variety of hand gestures. Instead of employing 2Dimage features to represent the hand directly, 3D hand model basedapproaches use a 3D kinematic hand model to render hand poses. Ananalysis-by-synthesis (ABS) strategy is employed to recover the handmotion parameters by aligning the appearance projected by the 3D handmodel with the observed image from the camera. Generally, it is easierto achieve real-time performance with appearance-based approaches due tothe fact of simpler 2D image features. However, this type of approachescan only handle simple hand gestures, like detection and tracking offingertips. In contrast, 3D hand model based approaches offer a richdescription that potentially allows a wide class of hand gestures. Themain challenging problem is that 3D hand is a complex 27 DoF deformablemodel. To cover all the characteristic hand images under differentviews, a very large database is thus required. Matching the query imagesfrom the video input with all hand images in the database istime-consuming and computationally expensive. This is why most existing3D hand model based approaches focus on real-time tracking for globalhand motions with restricted lighting and background conditions.

SUMMARY

It is an object to address some of the problems outlined above, and toprovide a solution for computational efficient real-time gesturerecognition. This object and others are achieved by the method and thedevice according to the independent claims, and by the embodimentsaccording to the dependent claims.

In accordance with a first aspect, a method for recognizing a 3D gestureis provided. The method is performed in a device having access to adatabase of gesture images. The device communicates with a sensoradapted to capture an image of the 3D gesture. The database of gestureimages comprises indexable features of normalized gesture images. Theindexable features comprise a position and an orientation for each pixelof edge images of the normalized gesture images. The method comprisescapturing an image of the 3D gesture via the sensor, and normalizing thecaptured image in accordance with the normalized gesture images of thedatabase. The method also comprises deriving indexable features from thenormalized captured image. The indexable features comprise a positionand an orientation for each pixel of an edge image of the normalizedcaptured image. The method further comprises comparing the derivedindexable features with the indexable features of the database using asimilarity function, and determining a gesture image in the databasematching the 3D gesture based on the comparison.

In accordance with a second aspect, a device for recognizing a 3Dgesture is provided. The device is configured to have access to adatabase of gesture images comprising indexable features of normalizedgesture images. The indexable features comprise a position and anorientation for each pixel of edge images of the normalized gestureimages. The device is connectable to a sensor adapted to capture animage of the 3D gesture. The device comprises a processing unit. Theprocessing unit is adapted to capture the image of the 3D gesture viathe sensor, normalize the captured image in accordance with thenormalized gesture images of the database, and derive indexable featuresfrom the normalized captured image. The indexable features comprise aposition and an orientation for each pixel of an edge image of thenormalized captured image. The processing unit is also adapted tocompare the derived indexable features with the indexable features ofthe database using a similarity function. The processing unit is furtheradapted to determine a gesture image in the database matching the 3Dgesture based on the comparison.

An advantage of embodiments is that high resolution gesture recognitionis made possible in real time with less computational resources.

Other objects, advantages and features of embodiments will be explainedin the following detailed description when considered in conjunctionwith the accompanying drawings and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow diagram schematically illustrating a method for gesturetracking and recognition according to prior art.

FIG. 2A is a schematic, pictorial illustration of a 3D user interfacesystem in mobile platform, in accordance with embodiments of the presentinvention.

FIG. 2B is a schematic, pictorial illustration of the 3D user interfaceemploying wearable device, in accordance with embodiments of the presentinvention.

FIG. 2C is a schematic, pictorial illustration of the 3D user interfacein stationary platform, in accordance with embodiments of the presentinvention.

FIG. 3 is schematically illustrating a method and the system accordingto embodiments of the present invention.

FIG. 4 is a flow chart schematically illustrating a method for storinggesture entries in a database according to embodiments of the presentinvention.

FIG. 5 is a flow chart schematically illustrating a method for searchingthe gesture entries and finding the match for a query input according toembodiments of the present invention.

FIG. 6 is a flow chart schematically illustrating a method for imagequery processing according to embodiments of the present invention.

FIG. 7 schematically illustrates a method for interface level accordingto embodiments of the present invention.

FIG. 8 schematically illustrates a mobile device 20 shown in FIG. 2Aaccording to embodiments.

FIG. 9 schematically illustrates a wearable device 20 shown in FIG. 2Baccording to embodiments.

FIG. 10 schematically illustrates a stationary device 20 shown in FIG.2C according to embodiments.

FIG. 11a-11b schematically illustrate the method performed by the deviceaccording to embodiments.

FIG. 12 schematically illustrate the device according to embodiments.

DETAILED DESCRIPTION

Overview

3D gesture recognition is a highly desired feature in interaction designbetween human and future mobile devices. Specifically, in virtual oraugmented reality environments, intuitive interaction with the physicalworld seems unavoidable and 3D gestural interaction might be the mosteffective alternative for the current input facilities such as trackpads and touchscreens. In embodiments of the invention, a solution for3D gesture recognition and tracking is provided. The proposedmethodology and system are based on a match finding in an extremelylarge gesture database. This database includes captured entries ofvarious types of hand gestures with all the possible variations inrotations and positioning and the corresponding position/orientationparameters. A similarity analysis of the attributes between the queryinputs and database entries is performed. The system retrieves the matchincluding the database entry and the annotated information for theacquired query input.

Unlike the classical computer vision approaches that requires greatamount of power, computation and memory, a new framework is defined tosolve the same problem but using a totally different approach. Theproposed technology can handle the complexity of e.g. the high DoF handmotion with a large-scale search framework, whereas the currenttechnology is limited to low resolution gesture recognition andtracking.

For general mobile device applications, full range of hand/body gesturesneed to be covered. To handle the challenging exhaustive search problemin high dimensional space of human gestures, an efficient indexingalgorithm for large-scale search on gesture images is proposed. Theadvantage of the disclosed system is the extremely fast retrieval onhuge number of database images that can handle the high DoF hand motionin various lighting conditions, with the presence of noise and clutter.The solution is adapted to the special requirements for mobileapplications, like real-time, low-complexity and robustness as well asthe high resolution tracking and accuracy.

According to embodiments of the invention, any mobile, wearable, orstationary device equipped with vision sensors or other type of sensorssuch as a mobile camera, a web-cam, a depth sensor, or an ultra-soundsensor, is enabled to determine or recognize human gestures e.g. hand,head, or body gesture in 3D space. Gesture tracking is performed usingthe determined or recognized gestures in a sequence of query inputs. Therecognition and tracking is based on an advanced search system searchingin an extremely large database (DB) of annotated gesture entries. Thedatabase includes all the possible hand gestures with all thedeformations and variations in 3D space, which may correspond tomillions of entries. At each moment, for any query gesture, the proposedsystem automatically searches through the database and retrieves thebest match. This will result in real-time 3D gesture tracking. Thetechnology facilitates the user-device interaction in real-timeapplications where intuitive 3D interaction might be used. Embodimentsof the invention are designed to support the interaction onmobile/wearable devices such as smartphones and augmented realityglasses. It can also be used for stationary, mobile, and other digitaldevices.

FIG. 3 illustrates an embodiment of the present invention, including amethodology and a system applicable to smartphones, mobile devices,wearable smart devices, stationary systems and digital gadgets. Itincludes four main components: the pre-processed annotated and indexedgesture database 50, image query processing unit 30 that receives aquery gesture, a real-time gesture search engine 70 that receives thequery gestures and automatically retrieves the best match from thedatabase of gestures, and finally, the interface level 90 that receivesthe output of the search engine and applies that to the ongoingapplication. The required hardware platform is any digital device.

System Description

FIG. 2A is a schematic illustration of a 3D user interface system 200A,in accordance with an embodiment of the present invention. The userinterface is based on a smart device 20 of any kind (mobile, stationary,wearable etc.) equipped with sensor/s of any type 34 (e.g. 2D/3D camera,ultra sonic, 3D depth camera, IR camera), which captures 3D sceneinformation behind, in front of, and/or around the device 20, thatincludes a gesture 32, e.g. hand, head, or body gesture of a human user10. In order to detect/recognize the gesture 32 (hand/head/bodygesture), smart device 20 captures the gesture images with sufficientresolution to enable the gesture 32 (hand/head/body gesture) and itsspecific position and orientation to be extracted. A position representsspatial coordinates of a gesture center(x,y) in the image plus gesturescale (distance from sensor in z), and orientation is the relativeorientation of the hand gesture with regards to the sensor's 3Dcoordinates (x,y,z). In addition to gesture 32 (hand/head/body gesture),captured image or query image 33 typically includes other body partsand/or cluttered background.

In FIG. 2A system 200A captures and processes a sequence of query images33 containing user's gesture 32 (hand/head/body gesture). While the user10 performs a gesture 32 (hand/head/body gesture), system 200A tracksuser's gesture 32 (hand/head/body gesture) over the sequence of queryimages 33. Software running on a processing unit 24 in device 20 and/orcapturing sensor 34 processes the image sequence to retrieve indexablefeatures 36 of user gestures 32 in each query image 33, as explained indetail herein below. The software matches the extracted indexablefeatures 36 to large-scale vocabulary table of indexed features 72 inorder to find the best match for the query images 33, as explained indetail herein below. The large scale vocabulary table is a large-scalematrix of indexable features from database images.

Database 52 is composed of millions of images of hand gestures. Handgesture images are annotated with specific 3D motion parameters (threeposition and three orientation parameters) 58, as explained in detailherein below. Finding the best hand gesture image in database 52 for thequery input 33 provides 3D motion parameters of the query input 33.

The method illustrated in FIG. 5 also analyzes gesture maps 73 overmultiple frames in the sequence in order to optimize, and speed up thesearch process, as described herein below.

The system may also include motion tracking functions to track usergestures 32 over a sequence of query inputs 33, so that the methodillustrated in FIG. 5 may optionally be performed only once in every two(or more) frames

Detected/recognized output/s (action/gesture/3D motion, annotated image,. . . ) 92 is provided via Application Programming Interface (API) to anapplication program running on device 20. This program may, for example,move and modify images, 3D objects, or other 2D/3D visual content 94presented on display 100 in response to the performed gesture/s 32.

As an alternative, all or some of these processing functions may becarried out by a suitable processor that is integrated with any othercomputerized device, such as a game console, media player, smart TVsetc. Any computerized apparatus equipped by capture sensor 34 (2D/3Dcamera, IR sensor, ultra sonic etc.), storing unit 22, and processingunit 24, can utilize at least some of the mentioned functions to providebetter user interface system.

Providing the Database of Gesture Images

FIG. 4 is a diagram of the method 50 for forming the indexable features54 of the database of annotated gesture images 52. The database containsa large set of different real images 56 of hand gesture entries with allthe potential variations in orientation, positioning and scaling. It canalso include all hand gestures graphics 57 synthesized by 3D articulatedhand models/3D graphical models etc. with known position and orientationparameters.

Besides the matching between the query input 33 and database, oneimportant feature that is aimed to achieve is to retrieve the 3D motionparameters (three position and three orientation parameterscorresponding to the three dimensions) from the query input 33. Sincequery inputs 33 do not contain the 3D motion parameters (threeorientation and three position parameters), the best solution is toassociate the 3D motion parameters (three orientation and three positionparameters) of the query input 33 to the best retrieved match from thedatabase. For this reason, the database entries are tagged with theirground-truth 3D motion parameters (three orientation and three positionparameters) 58. This can be done by means of any motion capture system,like vision-based systems, magnetic sensors, IMUs etc. Other sources ofgesture entries 59 are also being used to expand the database. Taggingthe 3D motion parameters (three orientation and three positionparameters) to hand gesture images, a database of annotated gestureimages 52 is formed. Each entry in the database of annotated gestureimages 52 represents pure gesture entries (background and noise free).The method 50 extracts indexable features 54 of each entry in thedatabase of annotated gesture images 52. Indexable features 54 includelow-level edge orientation attributes including the exact position andorientation of the edge pixels derived from the entries in the databaseof annotated gesture images 52. If each single edge pixel is consideredas a small line on the 2D image coordinates, orientation of the edgepixel is the angle of this small line with respect to the origin of theimage coordinates. Technically, it can be computed from gradient of theimage with respect to x and y directions.

In order to extract indexable features 54, all the entries in thedatabase of annotated gesture images 52 will be normalized and theircorresponding edge images are computed. An edge image may be computed byfiltering the gesture image. Different edge detectors are known in thecomputer vision field and can be used as well. Each single edge pixelwill be represented by its position and orientation. In order to make aglobal structure for low-level edge orientation features, a large-scalevocabulary table 72 is formed to represent all the possible cases thateach edge feature might happen. Considering the whole database withrespect to the position and orientation of the edges, a large-scalevocabulary table 72 can represent the whole vocabulary of the gesturesin edge pixel format. An edge pixel format is a representation of eachpixel of an edge image in terms of its position and orientation.

Image Query Processing

FIG. 6 is a diagram that schematically illustrates the method for imagequery processing 30. A query input 33 features a gesture 32(hand/head/body gesture) of user 10 with its specific three position andthree orientation parameters captured by the sensor/s 34 (2D/3D camera,IR sensor, ultra sonic etc.). Sensor/s 34 captures 3D scene informationbehind or in front of the device 20. Smart device 20 captures a sequenceof query inputs 33 and processes them to retrieve indexable features 36.The method 30 extracts indexable features 36 from the query inputs 33.Indexable features 36 include low-level edge orientation attributesincluding the exact position and orientation of the edge pixels derivedfrom the query inputs 33.

In order to extract indexable features 36, query input 33 will benormalized and their corresponding edge images are computed. Each singleedge pixel will be represented by its position and orientation.

Basically, query input 33 that captures user gesture 32 (hand/head/bodygesture), contains cluttered background caused by irrelevant objects,environmental noise, etc. thus, retrieved indexable features 36 fromquery inputs 33 contains both features from gestures 32 and noisybackground. On the other hand, each entry in the database of annotatedgesture images 52 represents pure gesture entries (background and noisefree), thus, retrieved indexable features 54 from each entry in thedatabase of annotated gesture images 52 only represents the featuresfrom the pure gestures. Therefore the edge image of the query imagecannot be defined as exact as the edge images of the database images.

Gesture Search Engine

FIG. 5 illustrates the method for gesture search engine 70. Extractedindexable features 54 of each entry in the database of annotated gestureimages 52 build a large-scale vocabulary table of indexable features 72in gesture search engine 70.

Large-scale vocabulary table of indexed features 72 is formed torepresent all the possible cases that each edge feature might happen.Considering the whole database with respect to the position andorientation of the edges, a large-scale vocabulary table 72 canrepresent the whole vocabulary of the gestures in edge pixel format. Forinstance, for image size of p*q pixel, and L edge orientationrepresentation, for a database of N images of gestures, the vocabularytable 72 will have p*q*L columns and N rows. Therefore, the vocabularytable 72 is filled with the indices of all database images 52 that havefeatures at the specific rows and columns. The Vocabulary table 72collects the required information from the whole database 52, which isessential in the method for gesture search engine 70.

In order to detect/recognize user gesture 32 in query image 33,large-scale search table 72 and retrieved indexable features 36 of eachquery image 33 are utilized by direct similarity analysis function 75 toselect top m first level matches in the database of annotated gestureimages 52.

Each query input 33 in edge pixel format contains a set of edge pointsthat can be represented by the row-column positions and specificorientation. Direct similarity function analysis 75 computes thesimilarity of the retrieved indexable features 36 of the query input 33with the large-scale vocabulary table of indexed features 72 based onthe positions and specific orientations of all the edge features. Thedirect similarity analysis function is a function that assigns a scoreto a pair of data values, where the score indicates the similarity ofthe indexed features of the query to indexed features of each entry inthe database. If the certain condition is satisfied for the retrievedindexable features 36 in the query input 33 and retrieved indexablefeatures 54 of the database of annotated gesture images 52, the directsimilarity analysis function 75 assigns +K1 points to all the databaseimages 52 that have an edge with similar direction at those specificrow-column positions. Direct similarity analysis function 75 performsthe mentioned process for each single edge pixel format of the queryinput 33.

The first step of direct similarity analysis function 75 satisfies thecase where two edge patterns from the query input 33 and database images52 exactly cover each other, whereas in most real cases two similarpatterns are extremely close to each other in position but there is nota large overlap between them. For these cases that regularly happen, thedirect similarity analysis function 75 assigns extra points based on thefirst and second level neighbor pixels.

A very probable case is when two extremely similar patterns do notoverlap but fall on the neighboring pixels of each other. In order toconsider these cases, besides the first step of direct similarityanalysis function 75, for any single pixel the first level 8 neighboringand second level 16 neighboring pixels in the database images should beconsidered for assigning extra points. The first level 8 neighboringpixels of any single pixel are the ones that surround the single pixel.The second level neighbors include 16 pixels that are surrounding thefirst level 8 neighboring pixels. All the database images 52 that haveedge with similar direction in the first level and second levelneighbors receive +K2 and +K3 points respectively (K1>K2>K3). In short,direct similarity analysis 75 is performed for all the edge pixels inthe query with respect to the similarity to the database images in threelevels with different weights. Finally, the accumulated score of eachdatabase image is calculated and normalized and the maximum scores areselected as the top m first level matches.

In order to find the closest match among the top m first level matches,reverse similarity analysis 76 is performed. Reverse similarity analysis76 means that besides finding the similarity of the query gesture 32 tothe entries of the database of annotated gesture images 52, the reversesimilarity of the selected top m entries of the database of annotatedgesture images 52 to the query gesture 32 should be computed. Thereverse similarity function is used for accuracy reasons. Not using thereverse similarity analysis would give lower accuracy of retrieval, butreduces the complexity.

Reverse similarity analysis 76 returns the best n matches (n<m), fromthe database of annotated images 52 for the given user gesture 32.Combination of the direct similarity analysis 75 and reverse similarityanalysis 76 functions returns the best match from the database ofannotated gesture images 52 for the query input 33.

Another optional step in gesture search engine 70 is smoothness of thegesture search by employing gesture neighborhood analysis function 77.Smoothness means that the retrieved best matches in a sequence of 3Dgestural interaction should represent the smooth motion. In order toperform a smooth retrieval, entries in the database of annotated gestureimages 52 are analyzed and mapped to high dimensional space to detectgesture maps 73. Gesture maps 73 indicate that which gestures are closerto each other and fall in the same neighborhood in high dimension.Therefore, for a query input 33 in a sequence, after performing thedirect similarity analysis function 75, the reverse similarity will becomputed by reverse similarity analysis function 76 and top matches willbe selected. Afterwards, the method 70 searches the gesture maps 73 tocheck which of these top matches is closer to the previous frame matchand the closest entry from the database of annotated images 52 will beselected as the final best match. Afterwards, the tagged 3D motionparameters (three position and three orientation parameters) 58 to thebest match can be immediately used to facilitate various applicationscenarios running on display 100.

Interface

FIG. 7 is a flow chart that schematically illustrates the method for theinterface level 90 that receives the detection/recognition output(actions/gesture/3D motion, annotated image etc.) 92 of the searchengine 70. Detected/recognized parameters (actions/gesture/3D motion,annotated image etc.) 92 are provided via the application programminginterface (API) to an application running on device 20. Application mayinclude 2D/3D video game, 2D/3D object modeling/rendering, photobrowsing, map, navigation etc. presented on display 100. User 10perceives output visual contents (2D/3D) 94 on the display 100 which arecontinuously being modified in response to user gesture 32 performances.

Detailed Description of Device

FIG. 8 illustrates the mobile device 20 shown in FIG. 2A. The mobiledevice 20 consists of a storing unit 22, a processing unit 24, a sensor34 (e.g. 2D/3D camera, IR sensor, ultra sonic etc.), and a display 100.The sensor 34 captures 3D scene information in front of the device 20.The mobile device 20 may also include a rear sensor 34 (e.g. 2D/3Dcamera, IR sensor, ultra sonic etc.) that captures 3D scene informationbehind the mobile device 20. The mobile device 20 captures a sequence ofquery inputs 33 and processes them to retrieve indexable features 36.The storing unit 22 stores the database of annotated gesture images 52,large-scale vocabulary table of indexed features 72, and gesture maps73. The processing unit 24 performs the method for image queryprocessing 30, and the method for the search engine 70. The processingunit 24 also modifies output visual content (2D/3D) 94 presented on thedisplay 100 in response to user gesture 32 performances. The display 100displays an application running on the mobile device 20. The applicationmay include 2D/3D video game, 2D/3D object modeling/rendering, photobrowsing, map, navigation etc. presented on the display 100. The user 10perceives output visual contents (2D/3D) 94 on the display 100 which arecontinuously being modified in response to user gesture 32 performances.

FIG. 9 illustrates the wearable device 20 shown in FIG. 2B. The wearabledevice 20 consists of a storing unit 22, a processing unit 24, a sensor34 (e.g. 2D/3D camera, IR sensor, ultra sonic etc.), and a display 100.The sensor 34 captures 3D scene information in front of the wearabledevice 20. The wearable device 20 captures a sequence of query inputs 33and processes them to retrieve indexable features 36. The storing unit22 stores the database of annotated gesture images 52, large-scalevocabulary table of indexed features 72, and gesture maps 73. Theprocessing unit 24 performs the method for image query processing 30,and the method for the search engine 70. The processing unit 24 alsomodifies output visual content (2D/3D) 94 presented on the display 100in response to user gesture 32 performances. The display 100 displays anapplication running on the wearable device 20. The application mayinclude 2D/3D video game, 2D/3D object modeling/rendering, photobrowsing, map, navigation etc. presented on the display 100. The user 10perceives output visual contents (2D/3D) 94 on the display 100 which arecontinuously being modified in response to user gesture 32 performances.

FIG. 10 illustrates stationary device 20 shown in FIG. 2C. Stationarydevice 20 consists of storing unit 22, processing unit 24, sensor 34(2D/3D camera, IR sensor, ultra sonic etc.), and display 100. Sensor 34captures 3D scene information in front of stationary device 20.Stationary device 20 captures a sequence of query inputs 33 andprocesses them to retrieve indexable features 36. Storing unit 22 storesthe database of annotated gesture images 52, large-scale vocabularytable of indexed features 72, and gesture maps 73. Processing unit 24performs the method for image query processing 30, and the method forthe search engine 70. Processing unit 24 also modifies output visualcontent (2D/3D) 94 presented on display 100 in response to user gesture32 performances. Display 100 displays an application running onstationary device 20. Application may include 2D/3D video game, 2D/3Dobject modeling/rendering, photo browsing, map, navigation etc.presented on display 100. User 10 perceives output visual contents(2D/3D) 94 on the display 100 which are continuously being modified inresponse to user gesture 32 performances.

Method and Device According to Embodiments

The problem of resource demanding computations together with limitedpower in devices used for real-time gesture recognition is addressed inembodiments of the invention. FIG. 11a is a flow chart illustrating amethod for recognizing a 3D gesture according to embodiments. The methodis performed in a device 20 having access to a database 52 of gestureimages and communicating with a sensor 34. The sensor 34 is adapted tocapture an image 33 of the 3D gesture. The sensor may be an integratedpart of the device or it may be a separate sensor connectable to thedevice. The database 52 of gesture images comprises indexable features54 of normalized gesture images, the indexable features comprising aposition and an orientation for each pixel of edge images of thenormalized gesture images. The device may comprise a storing unit 22 forstoring the database 52 or it may comprise an interface unit forcommunicating via a remote database node storing the database 52 e.g.via the internet. The method comprises:

-   -   110: Capturing an image 33 of the 3D gesture via the sensor 34.        In embodiments, capturing the image may comprise capturing a        sequence of images of the 3D gesture. The sequence of images may        be used to refine the determining of a matching database image,        as will be detailed below.    -   120: Normalizing the captured image. The normalizing may be done        in accordance with the normalized gesture images of the database        to allow for a comparison. Normalization may comprise resizing        the captured image to the size of database images. The database        entries are typically normalized to a standard image size such        as 320*240 pixels or 640*480 pixels, and therefore the captured        image may be normalized to the specific size of the database        entries.    -   130: Deriving indexable features 36 from the normalized captured        image 33. The indexable features 36 comprise a position and an        orientation for each pixel of an edge image of the normalized        captured image.    -   140: Comparing the derived indexable features 36 with the        indexable features derived 54 of the database using a similarity        function.    -   150: Determining a gesture image in the database 52 matching the        3D gesture based on the comparison.

One advantage of using indexable features that comprise a position andan orientation for each pixel of an edge image of the normalizedcaptured image is that it allows for a computational efficient way ofrecognizing 3D gestures.

FIG. 11b is a flow chart of the method in the device according toanother embodiment. The method comprises the steps described above withreference to FIG. 11a . However, the step of comparing 140 the derivedindexable features 36 further comprises:

-   -   141: Using a direct similarity analysis to determine a plurality        of gesture images in the database matching the captured image;        and    -   142: Using a reverse similarity analysis of the plurality of        gesture images to determine a subset of the plurality of gesture        images matching the captured image.

In this embodiment, the gesture image in the database 52 matching the 3Dgesture is determined 150 to be one of the subset of the plurality ofgesture images. However, the step 142 of using the reverse similarityanalysis is optional as already described previously. When notperforming the reverse similarity analysis, the gesture image in thedatabase 52 matching the 3D gesture is determined 150 to be one of theplurality of gesture images, determined from the direct similarityanalysis. The direct and reverse similarity analyses are furtherdescribed in the subsection “Gesture search engine” above. The reversesimilarity analysis 76 may be used for accuracy reasons. However,although not using the reverse similarity analysis would give loweraccuracy of retrieval, the advantage is that it reduces the complexity.

The flowchart in FIG. 11b also illustrates that the method may furthercomprise using 160 the determined gesture image matching the 3D gestureto modify a visual content presented on a display, as have beenexemplified e.g. in section “Interface” above.

Two very similar gesture images may not have overlapping edge pixels,but may fall on the neighboring pixels of each other. In order toconsider these cases, besides the first step of direct similarityanalysis function 75, the first level 8 neighboring and second level 16neighboring pixels in the database images may be considered whencomparing with the captured image. Therefore, in embodiments, the methodperformed by the device may further comprise:

-   -   Deriving additional indexable features comprising a position and        an orientation for neighbour pixels of each pixel of the edge        image from the normalized captured image; and    -   Comparing the derived additional indexable features with        additional indexable features of the database using the        similarity function.

The gesture image in the database 52 matching the 3D gesture may then bedetermined based also on the comparison of the additional indexablefeatures.

Furthermore, the gesture image matching the 3D gesture may be determinedbased on a gesture map indicating gestures images that are close to eachother in a sequence of gesture images. The method in the device mayfurther comprise tracking a user gesture based on the sequence ofimages, and the gesture image in the database matching the 3D gesturemay be determined based also on the tracked user gesture.

In any of the embodiments described above, each entry in the database 52of gesture images may be tagged with associated 3D motion parameterscomprising three orientation and three position parameters. The methodmay therefore also further comprise retrieving the 3D motion parametersassociated with the determined gesture image matching the 3D gesturefrom the database.

FIG. 12 is a block diagram schematically illustrating the device 20 forrecognizing a 3D gesture according to embodiments. The device 20 isconfigured to have access to a database 52 of gesture images comprisingindexable features 54 of normalized gesture images. The indexablefeatures comprise a position and an orientation for each pixel of edgeimages of the normalized gesture images. The device is connectable to asensor 34 adapted to capture an image 33 of the 3D gesture. The sensor34 may be comprised in the device 20, or it may be separate from thedevice. The device 20 comprises a processing unit 24 adapted to capturethe image 33 of the 3D gesture via the sensor, normalize the capturedimage, and derive indexable features 36 from the normalized capturedimage 33. The indexable features comprise a position and an orientationfor each pixel of an edge image of the normalized captured image. Theprocessing unit 24 is also adapted to compare derived the indexablefeatures 36 with the indexable features 54 of the database using asimilarity function, and determine a gesture image in the database 52matching the 3D gesture based on the comparison.

The processing unit 24 may be further adapted to compare the derivedindexable features by using a direct similarity analysis to determine aplurality of gesture images in the database matching the captured image,and to determine the gesture image in the database 52 matching the 3Dgesture to be one of the plurality of gesture images.

Furthermore, the processing unit 24 may be further adapted to comparethe derived indexable features by using a reverse similarity analysis ofthe plurality of gesture images to determine a subset of the pluralityof gesture images matching the captured image, and to determine thegesture image in the database 52 matching the 3D gesture to be one ofthe subset of the plurality of gesture images.

In embodiments, the processing unit 24 may be further adapted to deriveadditional indexable features comprising a position and an orientationfor neighbour pixels of each pixel of the edge image from the normalizedcaptured image. The processing unit 24 may be further adapted to comparethe derived additional indexable features with additional indexablefeatures of the database using the similarity function, and determinethe gesture image in the database 52 matching the 3D gesture based alsoon the comparison of the additional indexable features.

The processing unit 24 may be further adapted to determine the gestureimage matching the 3D gesture based on a gesture map indicating gesturesimages that are close to each other in a sequence of gesture images. Theprocessing unit 24 may be adapted to capture a sequence of images of the3D gesture via the sensor 34. In this embodiment, the processing unitmay be further adapted to track a user gesture based on the sequence ofimages, and determine the gesture image in the database 52 matching the3D gesture based also on the tracked user gesture.

The processing unit 24 may be further adapted to use the determinedgesture image matching the 3D gesture to modify a visual contentpresented on a display. Furthermore, each entry in the database 52 ofgesture images may be tagged with associated 3D motion parameterscomprising three orientation and three position parameters, and theprocessing unit 24 may be further adapted to retrieve 3D motionparameters associated with the determined gesture image matching the 3Dgesture from the database 52.

The device 20 may in embodiments comprise a memory containinginstructions executable by said processing unit 24 whereby the device isoperative to capture the image of the 3D gesture via the sensor,normalize the captured image in accordance with the normalized gestureimages of the database, derive indexable features from the normalizedcaptured image, compare the derived indexable features with theindexable features of the database using a similarity function, anddetermine a gesture image in the database matching the 3D gesture basedon the comparison. The device 20 may also comprise an interface circuitconnected to the processing unit 24 and configured to communicate withthe sensor 34 and/or the database 52.

In an alternative way to describe the embodiment in FIG. 12, the device20 may comprise means for capturing the image of the 3D gesture via thesensor, means for normalizing the captured image in accordance with thenormalized gesture images of the database, means for deriving indexablefeatures from the normalized captured image, means for comparing thederived indexable features with the indexable features of the databaseusing a similarity function, and means for determining a gesture imagein the database matching the 3D gesture based on the comparison. Themeans described are functional units which may be implemented inhardware, software, firmware or any combination thereof. In oneembodiment, the means are implemented as a computer program running on aprocessor.

In still another alternative way to describe the embodiment in FIG. 12,the device 20 may comprise a Central Processing Unit (CPU) which may bea single unit or a plurality of units. Furthermore, the device 20 maycomprise at least one computer program product (CPP) in the form of anon-volatile memory, e.g. an EEPROM (Electrically Erasable ProgrammableRead-Only Memory), a flash memory or a disk drive. The CPP may comprisea computer program, which comprises code means which when run on the CPUof the device 20 causes the device 20 to perform the methods describedearlier in conjunction with FIG. 11a-b . In other words, when said codemeans are run on the CPU, they correspond to the processing unit 24 inFIG. 12.

The above mentioned and described embodiments are only given as examplesand should not be limiting. Other solutions, uses, objectives, andfunctions within the scope of the accompanying patent claims may bepossible.

The invention claimed is:
 1. A method for recognizing a threedimensional, 3D, gesture, the method being performed in a device (20)having access to a database (52) of gesture images, the devicecommunicating with a sensor (34) adapted to capture an image (33) of the3D gesture, wherein the database (52) of gesture images comprisesindexable features (54) of normalized gesture images, the indexablefeatures comprising a position and an orientation for each pixel of edgeimages of the normalized gesture images, the method comprising:capturing (110) the image (33) of the 3D gesture via the sensor,normalizing (120) the captured image in accordance with the normalizedgesture images of the database (52) to allow for a comparison, deriving(130) indexable features (36) from the normalized captured image (33),the indexable features (36) comprising a position and an orientation foreach pixel of an edge image of the normalized captured image, comparing(140) the derived indexable features (36) with the indexable features(54) of the database using a similarity function, and determining (150)a gesture image in the database (52) matching the 3D gesture based onthe comparison; wherein each entry in the database (52) of gestureimages is tagged with associated 3D motion parameters comprising threeorientation and three position parameters, the method furthercomprising: retrieving 3D motion parameters associated with thedetermined gesture image matching the 3D gesture from the database (52).2. The method according to claim 1, wherein comparing (140) the derivedindexable features further comprises: using (141) a direct similarityanalysis to determine a plurality of gesture images in the databasematching the captured image, and wherein the gesture image in thedatabase (52) matching the 3D gesture is determined (150) to be one ofthe plurality of gesture images.
 3. The method according to claim 2,wherein comparing (140) the derived indexable features furthercomprises: using (142) a reverse similarity analysis of the plurality ofgesture images to determine a subset of the plurality of gesture imagesmatching the captured image, and wherein the gesture image in thedatabase (52) matching the 3D gesture is determined (150) to be one ofthe subset of the plurality of gesture images.
 4. The method accordingto claim 1, further comprising: deriving additional indexable featurescomprising a position and an orientation for neighbour pixels of eachpixel of the edge image from the normalized captured image, comparingthe derived additional indexable features with additional indexablefeatures of the database using the similarity function, and wherein thegesture image in the database (52) matching the 3D gesture is determinedbased also on the comparison of the additional indexable features. 5.The method according to claim 1, wherein the gesture image matching the3D gesture is determined based on a gesture map indicating gesturesimages that are close to each other in a sequence of gesture images. 6.The method according to claim 1, wherein capturing (110) the imagecomprises capturing a sequence of images of the 3D gesture.
 7. Themethod according to claim 6, further comprising: tracking a user gesturebased on the sequence of images, and wherein the gesture image in thedatabase (52) matching the 3D gesture is determined based also on thetracked user gesture.
 8. The method according to claim 1, furthercomprising: using (160) the determined gesture image matching the 3Dgesture to modify a visual content presented on a display.
 9. A device(20) for recognizing a three dimensional, 3D, gesture, the device beingconfigured to have access to a database (52) of gesture imagescomprising indexable features (54) of normalized gesture images, theindexable features comprising a position and an orientation for eachpixel of edge images of the normalized gesture images, the device beingconnectable to a sensor (34) adapted to capture an image (33) of the 3Dgesture, and the device comprising a processing unit (24) adapted to:capture the image (33) of the 3D gesture via the sensor (34), normalizethe captured image in accordance with the normalized gesture images ofthe database (52)) to allow for a comparison, derive indexable features(36) from the normalized captured image (33), wherein the indexablefeatures (36) comprise a position and an orientation for each pixel ofan edge image of the normalized captured image, compare the derivedindexable features (36) with the indexable features (54) of the databaseusing a similarity function, and determine a gesture image in thedatabase (52) matching the 3D gesture based on the comparison; whereineach entry in the database (52) of gesture images is tagged withassociated 3D motion parameters comprising three orientation and threeposition parameters, the processing unit (24) being further adapted to:retrieve 3D motion parameters associated with the determined gestureimage matching the 3D gesture from the database (52).
 10. The device(20) according to claim 9, wherein the processing unit (24) is furtheradapted to compare the derived indexable features by: using a directsimilarity analysis to determine a plurality of gesture images in thedatabase matching the captured image, the processing unit (24) beingfurther adapted to determine the gesture image in the database (52)matching the 3D gesture to be one of the plurality of gesture images.11. The device (20) according to claim 10, wherein the processing unit(24) is further adapted to compare the derived indexable features by:using a reverse similarity analysis of the plurality of gesture imagesto determine a subset of the plurality of gesture images matching thecaptured image, the processing unit (24) being further adapted todetermine the gesture image in the database (52) matching the 3D gestureto be one of the subset of the plurality of gesture images.
 12. Thedevice (20) according to claim 9, wherein the processing unit (24) isfurther adapted to: derive additional indexable features comprising aposition and an orientation for neighbour pixels of each pixel of theedge image from the normalized captured image, compare the derivedadditional indexable features with additional indexable features of thedatabase using the similarity function, and determine the gesture imagein the database (52) matching the 3D gesture based also on thecomparison of the additional indexable features.
 13. The device (20)according to claim 9, wherein the processing unit (24) is furtheradapted to determine the gesture image matching the 3D gesture based ona gesture map indicating gestures images that are close to each other ina sequence of gesture images.
 14. The device (20) according to claim 9,wherein the processing unit (24) is further adapted to capture asequence of images of the 3D gesture via the sensor (34).
 15. The device(20) according to claim 14, wherein the processing unit (24) is furtheradapted to: track a user gesture based on the sequence of images, anddetermine the gesture image in the database (52) matching the 3D gesturebased also on the tracked user gesture.
 16. The device (20) according toclaim 9, wherein the processing unit (24) is further adapted to: use thedetermined gesture image matching the 3D gesture to modify a visualcontent presented on a display.
 17. The method according to claim 2,further comprising: deriving additional indexable features comprising aposition and an orientation for neighbour pixels of each pixel of theedge image from the normalized captured image, comparing the derivedadditional indexable features with additional indexable features of thedatabase using the similarity function, and wherein the gesture image inthe database (52) matching the 3D gesture is determined based also onthe comparison of the additional indexable features.
 18. The methodaccording to claim 3, further comprising: deriving additional indexablefeatures comprising a position and an orientation for neighbour pixelsof each pixel of the edge image from the normalized captured image,comparing the derived additional indexable features with additionalindexable features of the database using the similarity function, andwherein the gesture image in the database (52) matching the 3D gestureis determined based also on the comparison of the additional indexablefeatures.
 19. The device (20) according to claim 10, wherein theprocessing unit (24) is further adapted to: derive additional indexablefeatures comprising a position and an orientation for neighbour pixelsof each pixel of the edge image from the normalized captured image,compare the derived additional indexable features with additionalindexable features of the database using the similarity function, anddetermine the gesture image in the database (52) matching the 3D gesturebased also on the comparison of the additional indexable features. 20.The device (20) according to claim 11, wherein the processing unit (24)is further adapted to: derive additional indexable features comprising aposition and an orientation for neighbour pixels of each pixel of theedge image from the normalized captured image, compare the derivedadditional indexable features with additional indexable features of thedatabase using the similarity function, and determine the gesture imagein the database (52) matching the 3D gesture based also on thecomparison of the additional indexable features.